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Abstract: We present a weighted-LASSO method to infer the parameters 
of a first-order vector auto-regressive model that describes time course 
expression data generated by directed gene-to-gene regulation networks. 
These networks are assumed to own prior internal structures of connectiv- 
ity which drive the inference method. This prior structure can be either 
derived from prior biological knowledge or inferred by the method itself. 
We illustrate the performance of this structure-based penalization both on 
synthetic data and on two canonical regulatory networks, first yeast cell 
cycle regulation network by analyzing Spellman et al's dataset and second 
E. coli S.O.S. DNA repair network by analysing U. Alon's lab data. 
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1. Introduction 

Along the dozen of years of statistical studies related to microarrays for gene ex- 
pression profiling, conditional dependency has been recognized as an appropriate 
statistical tool to model direct interactions between genes. Graph representation 
suits well such relationships between variables. As a consequence GGMs (Gaus- 
sian Graphical Models) have been widely studied by statisticians, particularly 
those looking for applications to the reconstruction of gene-to-gene regulation 
networks (See e.g. Schafer and Strimmer 2005, Mcinshausen and Biihlmann 

2006, Wille and Biihlmann 2006, Castelo and Roverato 2006, Drton and Pcrlman 

2007, Shimamura et al. 2007). In the context of transcriptomic data, the main 
statistical issue paradoxically relies on the scarcity of data: despite a shrinking 
cost, microarrays still provide dataset that fall into the high-dimensional set- 
ting. Namely, the number of variables (the p genes) remains greater than the 
sample size n (the number of microarray slides) . 

In the Gaussian independent identically distributed (hereafter i.d.d.) setting, 
each microarray experiment is considered as a realization of a Gaussian vector 
whose dependency structure is fully determined by its covariance matrix. It can 
be shown that non-null conditional dependencies between genes are described 
by nonzero entries of the inverse of the covariance matrix (Dempster 1972). 
Thus, inferring this matrix is equivalent to recovering the graph of interest, 
which is not trivial when n is smaller than or of the order of p. To handle 
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the data scarcity, methods based upon fj-norm are very popular: they answer 
to both questions of regularization and of variable selection by selecting the 
most significant edges between genes in the network. In the i.i.d setting, t\- 
pcnalizcd maximum likelihood Gaussian covariance estimation has been first 
investigated by Yuan and Lin (2007) and Banerjce et al. (2008) independently. 
These methods provide sparse graph estimates, sparsity being a characteristic 
of gene-to-gene regulation networks. 

Looking for an improvement of these methods regarding the biological con- 
text, we provided in Ambroise ct al. (2009) a method that not only looks for 
sparse solutions, but also for an internal structure of the network that drives 
the inference. Indeed, biological networks and particularly gene regulation net- 
works are known not only to be sparse, but also organized, so as nodes belong to 
different classes of connectivity. Thus, we suggested a criterion that takes this 
heterogeneity into account. This leads to a better inference when networks are 
highly structured. Note that Marlin et al. (2009) published subsequently an in- 
dependent paper providing a similar method in a Bayesian framework. In these 
two papers, the internal structure considered relies on affiliation networks. That 
is, genes are clustered into groups that share the same connectivity patterns. 
This can be seen as the analogous to the group-LASSO (Yuan and Lin 2006) 
applied to a graphical context. 

Finally, some authors (e.g. Opgcn-Rhcin and Strimmer 2007, Lcbrc 2009, 
Shimamura et al. 2009) underlined that transcriptomic dataset are not i.i.d. 
when considering time course expression data. Assuming a first-order vector 
auto-regressive (VAR1) model for the time course data generation, they pro- 
vided inference methods handling high-dimensional settings: Opgcn-Rhcin and 
Strimmer suggested a shrinkage estimate while Lebrc performed statistical tests 
on limited-order partial correlations to select significant edges. In a recent work, 
Shimamura ct al. (2009) proposed to deal with this VAR1 setup by combining 
ideas from two major developments of the LASSO to define the Recursive elastic- 
net. As an elastic-net (Zou and T. 2005), this method adds an £2 penalty to the 
original l\ regularization, thus encouraging the simultaneous selection of highly 
correlated covariates on top of the automatic selection process due to the i\ 
norm. As in the adaptive LASSO (Zou 2006), weights are corrected on the basis 
of a former estimate so as to adapt the regularization parameter to the relative 
importance of coefficients. Note that, in this context, we are no longer looking 
for an estimate of the inverse of the covariance matrix but of the parameters of 
the VAR1 model, which leads to a directed graph. 

In this paper, we aim to couple the time course data modeling by the VAR1 
model to an ^-regularizing approach that takes the internal structure of the 
network into account. This internal structure does not rely on an affiliation 
structure anymore since graphs inferred from time course data display a com- 
pletely asymmetrical pattern. The internal structure adopted here splits the 
genes into two groups: a group of hubs that exhibit a high connection proba- 
bility to all other genes and a group of leaves that only receive edges leaving 
from the hub class. This information can either be inferred as seen in this paper 
or recovered from biological expertise since recovering hubs consists roughly in 
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exhibiting transcription factors in regulatory networks, a large number of them 
being already identified by the biologists. 

Another refinement of our method is to built on the adaptive- LASSO (Zou 
2006, Zhou et al. 2009) which is known to reduce false positive rate compared 
to the classical LASSO. As such, our method belongs to the larger family of 
weighted-LASSO methods. Shimamura ct al. (2007) built upon Meinshausen 
and Biihlmann's neighborhood selection and the adaptive LASSO to improve 
inference of networks in an i.i.d. context. They chose separate penalties for each 
node's neighborhood selection problem and adapted each individual penalty co- 
efficient to the information brought by an initial ridge estimate. Here, we suggest 
to lower the bias of the LASSO by not only using information from an initial 
statistical inference but also from prior knowledge about the topology of the 
network that assumes the existence of genes with high connection probability 
to other genes. 

The rest of the paper is organized as follows: in the next Section, the VAR1 
model and the associated likelihood function are briefly recalled; an ^-penalized 
criterion is proposed where each parameter of the VAR1 model, representing the 
graph of interest, is weighted according to its belonging to the hub group. The 
weights can also depend on a previous estimate just as in the adaptive-LASSO. 
We also briefly recall available tools to guide the choice of the regularization 
parameter. In Section 3, the inference procedure is detailed: we present how 
the internal structure can be recovered; from that point, network inference re- 
duces to a convex optimization problem which we solve through an active-set 
algorithm based upon the approach of Osborne ct al. (2000). Finally, an exper- 
imental Section investigates the performances of the method. First, simulated 
data are considered; then, we try to recover edges implied in two different reg- 
ulation processes. First in yeast cell cycle, by analyzing the Spellman's dataset 
and comparing the selected edges to the direct regulations collected from the 
Yeastract database; second in E. coli, by analyzing U. Alon's precise kinetic 
data on S.O.S. DNA repair subnetwork. 

Remark. The code will soon be embedded in the R package SIMoNe. (Chiquet 
et al. 2009). 

2. Modeling Heterogeneous Regulation Networks from Time Course 
Data 

2.1. Auto-regressive Model and Sparse Networks 

Let V — {1, 2, . . . ,p} be the set of variables of interest, e.g., some p genes. Let us 
denote by (X t )teti the M p -valued stochastic process that represents the discrete- 
time evolution of the gene expression levels, written as a row vector. Also denote 
by X\ the expression level of gene i at time t and X t the expression level of 
all genes but i at time t. Herein, X t is assumed to be generated by a first-order 
vector auto-regressive (VAR1) model 



X t = Xt-iA + b + et, teN*, 
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where A = (Aij)i j e -p is anpxp matrix, b is a size-p row vector and s t is a white 
Gaussian process. Namely, e t ~ A/"(0, D) where D is a diagonal matrix such as 
Dm = of and cov(et,e s ) = l{ s=t }D for all s,t > 0. Moreover, X ~ A/"(^, X), 
with /i a size-p vector of means and S a covariance matrix. Also assume that 
cov(X t , e s ) = for all s > t: hence, X t is obviously a first-order Markov process. 

Since the covariance matrix D is diagonal, each entry is directly linked 
to the partial correlation coefficient between variables X\ and X\_ x . In fact, 

_zw(xlX\_Axi^) 

™(x\_ x \x)l^ 

thus nonzero entries of A code for the adjacency matrix of a directed graph 
describing the conditional dependencies between the elements of V. Inferring A 
is equivalent to reconstructing this graph and is the main issue of this paper. 

To this end, let us set up the estimation framework: assume that X t is ob- 
served on the time space t = 0, 1, . . . , n. Denote by X the (n+1) x p matrix of 
available centered, scaled to unit- variance data, whose t th row contains the infor- 
mation relative to the p variables at time t. The empirical variance-covariance 
matrix S and the empirical temporal covariance matrix V are then given by 

S = ^ X \" X \"' V = ~ X \n X \0' 

where X\t denotes matrix X deprived of its fc th row. 

The well-known maximum likelihood estimator (MLE) of A is easily recov- 
ered and recalled in the following proposition. 

Proposition 1. Maximizing the log-likelihood of the VAR1 process is equiva- 
lent to the following maximization problem 

max |Tr(V T A) - ^Tr(A T SA)| , 

whose solution is given by 

A mlc = S _1 V. (1) 

Remark. Thanks to the assumptions we made on e the VAR1 model can be 
seen as a usual regression problem: denote by X p (respectively Xj) the n first 
(respectively last) rows of X. A ols is naturally given by (XTX p ) _1 XJX/ = 
S _1 V = A mle . The MLE (1) is straightforwardly equivalent to the ordinary 
least square estimate (OLS) of A. 

Solution (1) requires a covariance matrix S that is invertible, which occurs 
when S is at least of rank p. In real situations the number of observations is often 
about or lower than the number of variables, thus MLE needs to be regularized. 
Regularization such as Moore-Penrose pseudo inversion or £i-regularization can 
be applied on matrix S in order to make the inversion always achievable. A 
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sharpest approach is investigated in Opgcn-Rhein and Strimmer (2007), where 
the OLS solution is regularized by shrinking both matrices S and V. 

We suggest to draw inspiration from the ^i-penalizcd likelihood approach 
developed by Banerjee et al. (2008) in the case of i.i.d. samples of a multivariate 
Gaussian distribution: here, samples are no longer i.i.d yet linked through time 
by the VAR1 model. Still, the sparsity can be controlled with a positive scalar 
p adjoined to an £i-norm penalty on A by solving 



Since MLE and OLS are equivalent in this framework, solution to the penalized- 
likelihood formulation (2) is equivalent to solving p independent LASSO problems 
on each column of A, which is exactly Mcinshausen and Biihlmann's approach. 
The difference is that it does not require any post-symmetrization since there 
is no symmetry constraint on A in the present context. 

2.2. A Structured Modeling of the Network 

To attempt a better fit of data, we suggest that A owns an internal structure 
that describes classes of connectivity between the variables. Indeed, the i^-norm 
regularization encourages a first restriction on the network's topology inferred 
through criteria (2), by encouraging sparsity. Yet, it is well known that by 
penalizing truly significant entries of A as much as truly zero entries a single 
t\ penalization leads to biased estimates and a particularly strong number of 
false positives (Knight and Fu 2000, Zou 2006). Weighted-LASSO approaches 
can lower this bias by adapting penalties to prior information about where the 
true zero entries should be, relying on possibly data-driven as well as biological 
information. An existing correction is given by the Adaptive-LASSO (Zou 2006, 
Zhou et al. 2009). Penalty coefficients are alleviated or increased using individual 
weights reversely proportional to an initial estimate A mlt . 

The main purpose of this paper is to show the interest of taking into account 
information about the topology of the network: not only should we scale coef- 
ficients individually, but also consider the underlying organization of V . Adap- 
tation of weights is made by providing A with a well-chosen prior distribution, 
relying on the organization of V . We assume that genes are spread through a 
partition of V into Q classes of connectivity. Both existences and weights of 
edges, described by the elements of A, depend on the connectivity class each 
vertex belongs to. Denote by Zi q the indicator function that gene i belongs to 
class q. Each entry Ay; Zi q Zj£ — 1 is provided with an independent prior distri- 
bution fijqi- Following Ambroise et al. (2009), we choose Laplace distributions 
for fij q e since it is the corresponding log-prior distribution to the £ i term in the 
LASSO. Hence, by choosing 




(2) 
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where Xij q i are scaling parameters, we expect a model whose log-likelihood will 
naturally make a specific ^-penalization term appear. 

Modeling hubs. Many configurations fit into this general model. In Ambroise 
ct al. (2009) we focused on an affiliation model. This structure opposes intra 
to inter-cluster connections, assuming the former to be far more likely than the 
latter. In the present context, where dynamic regulatory networks are repre- 
sented by directed graphs, the affiliation model unnaturally assumes symmetric 
probabilities for "incoming" and "outgoing" edges and should be banished. In- 
deed, adjacency matrices associated to directed gene regulatory networks are 
asymmetrical: genes belong to two completely different groups. While a group 
of hubs exhibits a high connection probability to all other genes, the remaining 
set of genes only receives edges leaving from the first class. Illustration of this 
phenomenon by Spellman ct al. (1998) 's dataset on Saccharomyces cerevisiae is 
presented in Section 4. This setup can be summarized as follows: 



fijql — 



/hub(-;A hu b) if q is the hub class, 
/leaf (•; A lcaf ) if q is not the hub class. 



Note this structure only differentiate edges on the basis of their origin, whether 
they leave from a hub or not, whatever be their arrival points. In this type of 
structure built around hubs, the number of classes is fixed at 2. 

Allowing for individual prior information about i and j, this model can be 
generalized to 



/hub (•; Ay A hub ) if q is the hub class, 
/leaf (•; Kj Aieaf) if q is not the hub class. 



The likelihood. As the matrix A has been given a prior distribution, our 
aim is to maximize the posterior probability of A, given the data X. For a fixed 
structure Z, this is equivalent to maximizing the joint probability 

A = argmax logP(X,A;Z). 

A 

Now, the likelihood P(X, A; Z) is straightforwardly given by 

logP(X, A; Z) - Tr (V* A) - ^Tr (A^SA) - ||P Z * A|| €l + c, (3) 



where c is a constant term and the p x p penalty matrix is defined by 



Practically, we obtain the following penalty 

~Pfj = Ay 1 • (X huh Zi t hub + A leaf Z;,ieaf) = P ■ Pij ■ (pimb^i,hub + Pleaf ^i.leaf) ; 
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where p > is a common factor to XT . and Aj~* f , which can vary so as to 
adapt the penalty while the ratio Aj^/Aj"^ = Phub/pieaf > 1 remains constant 
at a chosen level. Coefficient can be held fixed at 1 when no individual 
information is taken into account or replaced by any well-chosen transformation 
of an initial estimate of A in order to provide accurate information on where 
true zeros might be. 



2.3. Tuning the penalty parameter 

We briefly recall here the different techniques available in the literature. Asymp- 
totic theory of the LASSO demonstrates that a penalty parameter of order y/n 
guarantees both estimation consistency and selection consistency: asymptoti- 
cally, estimation is unbiaiscd and all relevant covariates arc included in the model 
with strictly positive probability (Knight and Fu 2000). In practice, this does 
not tell us which penalty to use for a fixed sample size n. To solve this problem, 
Tibshirani (1996) suggests the use of cross-validation. However it is well-known 
that the best penalty for prediction is not the best penalty for model selection 
purposes. Cross-validation is therefore unrevelant here. Optimality in terms of 
selection naturally draws attention towards penalties which would in some way 
control the false discovery rate. Closest to that goal are penalty choices which 
guarantee a control over the probability of connecting two nodes by a chain of 
edges though no such path exists in the true graph. Such penalties have been 
discussed in Bancrjee et al. (2008), Meinshausen and Biihlmann (2006) or Am- 
broise et al. (2009) for instance. However, as underlined in the latter, this kind of 
penalty is often much too conservative to be used as anything else than an upper 
bound on the set of interesting penalties. Relying on the Bayesian interpreta- 
tion of the LASSO, another option is to maximize the marginal probability of the 
data over all possible tuning parameters. A specific approximation for graphs 
is derived in Shimamura et al. (2007). Taking into account the fact that the 
number of degrees of freedom of the LASSO equals the final number of nonzero 
parameters (Zou et al. 2007) computations get a lot easier. Particularly, the BIC 
approximation of the marginal distribution as well as the AIC criterion, whose 
good properties for model selection are well-known, are trivial to compute. In 
our case, we obtain the following expressions: 



BICp = n 



AICp = n 



Tr ( V T A„ ) - -Tr (AISA, 



2 V P " 
Tr ( V^Apj - ^ Tr (aJSA p 



^dfp, 
2 p 



df P , 



where A p denotes the estimate of A associated to penalty p and df p the number 



of nonzero entries in A p . 



In practice, we observe that these last criteria present both advantages of 
being straightforward to compute and of providing impressively sensible results 
in terms of both Recall and Precision rates. We therefore adopt these criteria 
to select two best penalties to choose from in the remaining of the paper. 



Charbonnier, Chiquet, Ambroise/ 'Structured Dynamic Regulation Network inference 8 



3. Inference Strategy 

3.1. Structure Inference 

In many application fields, the structure can be considered as known, learned 
from expert knowledge. In genetic for instance, biologists can often extract the 
list of transcription factors from the overall set of target genes. 

Otherwise, the structure, or part of it, could remain latent: we suggest a basic 
strategy that performs well practically for biological networks. In this context, 
the structure goes down to the identification of hubs. To this purpose we suggest 
a very intuitive path. A first matrix A is estimated using an adequate single 
LASSO penalty. We rely on AIC and BIC criteria to identify the best initial 
penalty. Nodes are then classified into two groups, hubs and leaves, according to 
the values of the £i-norms of the corresponding rows in Ao. In order to account 
for the particularly strong heterogeneity between the two groups (differences 
in size and dispersion), a Gaussian mixture approach is used for clustering the 
genes. This defines two submatrices Aj and Aq containing respectively the lines 
corresponding to the first and second groups. Hubs are then characterized as 
the class with the maximum mean absolute value of Aq. 

3.2. Active-Set algorithm for Network Inference 

Once the internal structure has been recovered, inference of A amounts to op- 
timizing the penalized likelihood (3) where Z are fixed parameters. This can be 
achieved by solving some p independent LASSO-style problems since there is no 
symmetry constraint on A: denoting by M fc the k th column of a given matrix 
M, we wish to solve for each column of A the following minimization problem 

A fe = argmini(/3), where L(f3) — ^/3 T S/3 — [3 J ~V k + ||A * /3||^ , (4) 

f3 Z 

where A = (P z ) fc for clarity purpose. 

Solving penalized problem (4) can be achieved through various algorithms. 
The elegant active-set approach suggested in Osborne et al. (2000) takes ad- 
vantage of the sparsity of (3 to solve the equivalent constrained problem: start- 
ing from p as an initial guess, the set of active variables A = {i : /?; ^ 0} is 
updated at various stages of the algorithm so as we solve linear systems with 
limited sizes to determine the current nonzero coefficients denoted by /3_4 herein. 
The algorithm stops once the optimality conditions derived from the classical 
Karush-Kuhn- Tucker conditions are satisfied. In the next paragraph, we detail 
an adaptation to the present context of the Osborne ct al., initially developed 
for the LASSO for linear regression. 

The objective function L in (4) is convex, yet not differentiable everywhere 
due to the ^-norm: from convex analysis, (3 is solution to (4) iif p belongs to 
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the subdiffcrcntial of L, which mainly forms the optimality conditions of the 
problem. Here, the subdifferential is given by 

dpL(P) = S0 - V fc + A * 0, 

where 9 £ sign(/3), that is, Qi = sign (/%) if i £ A, and 6{ S [—1, 1] if i G A. 

Starting from — p , we select the component I of whose subgradicnt 
absolute value is maximal: as a matter of fact, a subgradient highly different 
from zero induces high violation of the optimality conditions. Such a choice will 
guarantee a large reduction of the objective function L during the optimization 
procedure. Thus, this component is added to the active set A = A U {£}. 

Then, optimization is only performed on nonzero coefficients 0a whose cardi- 
nal is small since the solution is likely to be sparse. This is done by minimizing 
L(@a), which reduces to a classical optimization problem because the subdiffer- 
ential turns to an usual gradient V^L on the active set A. 

While optimizing, the next update 0t = Pa + h is obtained by solving 
VhL(@A + h) = 0|_4|, which leads to the following descent direction 

h = -P A + $a,A ( V A - A.4 * sign(/34 + h)) . 

However sign(/3_4 + h) cannot be known while computing h and is consequently 
approximated by the current sign of 0a equal to 6a- 

h^-0A + s^ A (y A -A A *e A ). 

Due to this approximation, we check for sign-consistency between the candi- 
date update 0a + h and 9a- In case of inconsistency, the descent direction is 
reduced so as 0a + 7h is sign consistent with 9a- This ends the optimization 
part of the algorithm. 

Then, the active set A is updated since some 0i could have been set to zero 
during the optimization procedure: this is done by looking for vanished 0iS, 
verifying dpL(0i) = 0. Finally, optimality conditions are tested: if the maximal 
t of the subdifferential corresponding to an unactivated component of is zero, 
we have found a solution; otherwise, the active set is updated by adding I to A, 
since it induces the highest reduction of L. 

These three steps — optimization, deactivation and optimality testing - are 
repeated until a solution has been found, which is guaranteed (see Osborne et al. 
2000). The full algorithm is detailed below. Note that it can either start from 
= p or from a solution obtain from a more penalized problem with larger 
vector of penalties A, that speeds up the computation, hence having a behavior 
that is similar to the homotopy/LARS algorithm (Efron ct al. 2004). 

The full matrix A is directly recovered by binding column- wisely the solutions 
to the p LASSO-style problems. 

Remark. With this method, the sparsity constraint only applies to each column 
of A. This constraint implies that if we use n + 1 time points, S is of rank n 
and thus no more than n connections can be activated by the LASSO at most in 
each column (assuming the penalty is low enough to accept the activation of all 
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Algorithm 1: Active-set algorithm 

//INITIALIZATION 

P^P°,A^{i:pi ^O},0^sign(/3) 
while P g do 

111. OPTIMIZATION OVER A 

1 11.1 Compute the (approximate) direction h 

h = -^ + s^ iA (vi-A A *^) 

111.2 Check for sign consistency 
if sign(/3^ + h) ^ dj, then 

//Find a solution which is sign-feasible 

7, k <— arg min < 7 <i {7, fe £ .4 : (S k + 7^ fc = 0} 

Pa <- Pa + 7h 

else 

L /3^^/3^ + h 

//2. LOOK FOR NEWLY ZEROED VARIABLES 

for i : ^ = and min e6sign(/3i ) \dg.L(/3i)\ = do 

L A^A\{i} 

113. OPTIMALITY TESTING 

// Select t providing the highest reduction of L 

£ <- argmax ie _^ !/», where !/» = min 6)esign((3 . j |9 ft L(ft)| 

if = then 

■ Stop and return /3 
else 

I Update the active set: A = A U {<?} 
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possible edges). Consequently, the sparsity constraint only applies to incoming 
edges and not to outgoing ones. In that sense, sparsity assumptions implied by 
l\ penalization only assume that each node is regulated by a small set of nodes 
and do not contradict the existence of hubs regulating a huge set of nodes. 

4. Experiments and Discussion 

In this section we apply our algorithm to both synthetic and real data. Com- 
parison is made first within the family of the weighted-LASSO. We observe the 
performances of the LASSO when associated with a single LASSO penalty or an 
adaptive penalty. For the adaptive-LASSO, a single LASSO penalty is used as 
initial estimator. We then try two different hub penalties: one relying only on 
the known hub structure and a last one inferring the hub structure from the 
initial LASSO estimator. We denote these estimators by Lasso, Adaptive, KnwCl, 
and InfCl respectively. Corresponding penalties can be summarized as follows: 



pAdaptivc a | ^_ l | 

Pij™ 1 a (Phub-^Uiub + Ploaf^ijcaf) 
plnfCl a ^p huh Z i hub + pi ca fZij ca fj , 

where x V y — max{x, y} and Z denotes the inferred classification. In the re- 
mainder of this section, we fix the ratio pieaf/phub = 2, thus penalizing twice as 
much nodes labeled as leaves as nodes labeled as hubs. Note also that we choose 
to maintain the modification of adaptive weights adopted in Zhou ct al. (2009) 
and prevent the alleviation of penalty parameters. This trick ensures that the 
adaptive LASSO will select a subnetwork from the network inferred by the ini- 
tial LASSO estimate. No edge can be included if it was already excluded by the 
LASSO. In this way, the adaptive LASSO guarantees a decrease in false positives. 

Apart from our family of weighted-LASSO proposals, comparison will be made 
with state-of-the art network inference methods in a VAR1 setting: the Shrink- 
age method suggested by Opgcn-Rhcin and Strimmer (2007), the Recursive 
Elastic Net method (Renet-VAR) developed by Shimamura et al. (2009) and 
the method based on dynamic Bayesian networks proposed by Lebre (2009) 
and available in R within the G1DBN package. 

Here, the interest of the inference lies in the recovery of the true edges, in 
other words of whether the entries of A are correctly identified as nonzero. 
Our estimators are mainly used for discriminating nonzero entries from others. 
Quantities such as True Positives (TP), False Positives (FP), True Negatives 
(TN) and False Negatives (FN) summarize the performances of these classifiers. 
Precision TP/(TP + FP) is the ratio of the number of true nonzero elements 
to the total number of nonzero elements in the estimated matrix A. Recall 
TP/ (TP + FN) denotes the proportion of nonzero elements in A which were 



1 
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correctly recovered as nonzero in the estimation. Fallout FP/ (FP+TN) gives on 
the contrary the proportion of zero elements in A which were falsely declared as 
nonzero in the estimation. In statistical terms, the Recall (or Hit Rate) would 
be the empirical equivalent of the power of our classification method considered 
as a test, while the Fallout (or False Alarm Rate) would correspond to the 
first type a error. Note that, in the context of sparse network inference, the 
number of total positives is small compared to the number of total negatives. 
Thus, small variations of FP and TP will induce small variations in Fallout and 
large variations in Recall. Hence, comparison between Precision and Recall is 
generally more relevant than Fallout / Recall comparison in the present sparse 
context. This is why we will generally choose to omit Fallout rates when we 
need to alleviate the presentation of results. 

These rates are easily obtained for the LASSO based methods since they 
automatically produce null coefficients. By increasing the penalty parameter we 
obtain sparser and sparser graphs. We start from a large enough penalty to 
constrain all coefficients of A to and decrease the penalty until we include 
as many variables as allowed by the ratio n/p. We then select the best penalty 
from this list as the one maximizing either the BIC or the AIC criterion. 

Like the LASSO, Renet- VAR directly implements variable selection and penalty 
choice is included in the algorithm. Concerning G1DBN, we follow the author's 
advice to tune the parameters of the test procedure as described in the addi- 
tional material of Lebre (2009). When applying the Shrinkage method devel- 
opped by Opgen-Rhcin and Strimmcr (2007), a supplementary step is required 
to transform continuous results into a binary solution. We follow Opgen-Rhein 
and Strimmcr's advice and rely on local false discovery rates. This provides each 
edge with an existence probability conditional on the corresponding entry in A. 
We declare as inferred edge any edge with posterior probability exceeding the 
threshold of 80% as the authors do. 

4-1- Simulated Data 

Simulation settings. To assess the performances of our approach, we apply 
the previous model to a very favorable setup, where existing models already 
perform quite well. We then decrease the ratio n/p in order to observe the 
response of each method to this increasing lack of information. On top of that, 
we consider graphs of different sizes: small graphs of 20 nodes, larger graphs of 
100 nodes and a setup with 800 nodes. For smaller graphs, we consider three 
different amounts of observations: 10, 20 and 40. For medium sized graphs, 
we also consider the cases n = p/2 and n — p but omit the case n — 2p as 
unrealistic. The setup p = 800, n = 20 is meant to mimic Spcllman ct al.'s 
dataset. 

Simulation of the VAR1 process is based upon the simulation strategy used 
by Opgen-Rhein and Strimmcr in order to ease the comparisons, but introduces 
a structure based on hubs in order to better reflect the structure we could 
expect from a real data set. A graph is first simulated, with fixed numbers of 
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nodes and edges. Like Opgen-Rhein and Strimmer we simulate sparse graphs, 
with K — 2p edges. Nodes are split into two groups according to a multinomial 
distribution with probabilities (0.1,0.9), leading to 10% of hubs in average. Edges 
are then positioned in the graph according to a multinomial distribution, with 
85% of edges from hubs to leafs, and the remaining set within hubs. Exception 
is made for the very large graph, for which we base the number of edges and 
their distribution on Spcllman ct al.'s data. The matrix A is synthesized on the 
basis of this graph: we attribute a random partial correlation value uniformly 
distributed on [—1, —0.2] U [0.2, 1] to all nonzero coefficients (corresponding to 
edges in the graph). 

From this matrix, a VAR1 observation is generated, using a centered Gaus- 
sian starting value and a centered Gaussian noise, both with variance a 2 = 0.1. 
For computing time reasons, this is repeated 500 times for the small graphs, 200 
times for medium sized graphs and 100 times for the large graph. Results are 
averaged over all samples. 

To gain a better insight into the difficulty of these synthesized data set for 
a LASSO estimator, we checked whether the irrepresentability condition (Zhao 
and Yu 2006, Meinshausen and Yu 2008) was validated in all these very simple 
simulations. First, note that the graphical context requires the irrepresentability 
condition to be validated for each of the p genes at the same time, which makes 
it much more difficult to hold than in the simple regression context where it 
is an already strong hypothesis. In our context, since we solve p independent 
LASSO problems, we can check the validity of the hypothesis in each of these 
individual problems. For each gene, the irrepresentability condition is tested 
using the true sign pattern extracted from the corresponding column of the 
true adjacency matrix. Thus the sets of relevant and irrelevant covariates are 
allowed to vary from one problem to another. Simulating 100 samples of each 
simulation setting, we observed that even in a favorable setup with twice as many 
observations as variables (p = 20 genes) the irrepresentability condition fails for 
30% of genes in average . With p = 20 genes and only n = 10 observations this 
assumption fails in average for 51% of the genes. In other words, for around 
half of the genes we cannot expect the LASSO to recover the exact sign pattern. 
See Table 1 for details. Admittedly, the irrepresentability condition is a really 
strong assumption, necessary and sufficient for exact sign recovery, that is to 
say not only the exact neighborhoods (no false positives, no false negatives) but 
also the exact signs of the correlations. Yet since the simulated values are quite 
well separated between true zeros and true nonzeros we would have expected 
that this hypothesis would have been much more validated. Information about 
the validity of the restricted eigen-value assumptions (Bickcl ct al. 2009) would 
be greatly appreciated to compensate for such pessimistic results, but these are 
computationally intractable. Adaptation of Juditsky and Ncmirovsky (2008) 's 
results to the present context could be of great benefit. 

Discussion of simulation results. Results are presented in Figure 1 under 
the form of Barcharts. Figure 2 illustrates the case where p = 100 by giving 
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n/p \ 


20 


100 


2 


0.30 (0.23) 




1 


0.41 (0.23) 


0.37 (0.15) 


1/2 


0.51 (0.18) 


0.42 (0.12) 




Table 1 





Average proportion of genes for which the irrepresentability condition does not hold and 
standard error in each simulation setting. 

boxplots for the distributions of Precision, Recall and Fallout. 

Compared methods differ with the type of setting. First of all, since the 
Shrinkage method (particularly the local false discovery rate step) relies on the 
hypothesis that p is large, we do not consider it fair to apply it to the small 
network setting. Reversely, for computing time reasons we decided to restrict 
the application of G1DBN to the graphs of size p = 20. 

Penalties for the LASSO based methods were chosen on the basis of either the 
BIC or AIC criterion. Although theory states that the BIC ought to outperform 
the AIC in terms of model selection (Zou et al. 2007), we observed that in prac- 
tice the BIC criterion might be too conservative when n is small compared to p. 
In that situation, it might be interesting to favor the less stringent AIC criterion 
which will induce a higher recall rate for not such a large loss in precision. Note 
that the penalty choice based on the AIC or the BIC can lead to choose the 
null model as best model. In that case, Precision cannot be defined. We thus 
show the results for precision over all simulations where at least one variable 
was included. 

The first point worth noting in Figure 1 is that in all settings the LASSO is 
outperformed by weighted-LASSO methods and others. This quick check con- 
firms the interest of compensating for the bias induced by i\ regularization on 
large coefficients. It is also possible that what we observe about the validity 
of the irrepresentability condition jeopardizes the performances of the single- 
penalty LASSO. In line with Table 1, the LASSO performs particularly badly 
when the ratio n/p is not favorable, with recall and precision rates under 20% 
when p = 20, n — 10. It even performs so poorly that it deprecates the infer- 
ence based on adaptive weights. A priori information on where the true ze- 
ros might compensate for this apparent lack of "neighborhood stability" , using 
Mcinshausen and Biihlmann's vocabulary, and explain why the KnwC'l penalty 
is far more accurate (precision of 84% in average for a recall of nearly 50% in 
average for the same simulation setting p — 20, n = 10). 

As expected, in all settings (except when n is really too small compared to 
p) the Adaptive penalty improves the precision but at the price of a smaller 
recall rate. On the contrary, the inferred classification InfCl allows to improve 
the precision without undermining the recall rate. However, both methods are 
highly dependent on the initial LASSO estimate. Therefore, the gain in precision 
resulting from such methods decreases with the n/p ratio. 

Benefitting from a certain amount of supplementary information, the KnwCl 
penalty leads to a clear increase in both precision and recall. Particularly when 
little information is available in terms of number of observations, taking a priori 
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information about which genes are potential regulators and which are not into 
account improves the results dramatically. This is true when compared to all 
LASSO based methods but generalizes to Shrinkage, Renet-VAR and G1DBN. 
Admittedly, Renet-VAR leads to higher precision values with medium sized 
graphs, but it is compensated by smaller recall rates. 

Table 1 shows naturally that we cannot expect too much from very extreme 
settings (p = 800, n — 20, that is, the Spellman ct al.'s settings). Average Recall 
rate is less than 20% for all methods except the KnwCl penalty. In this case, 
knowledge of potential hubs allows the recall rate to almost double in average 
while increasing the precision. Note however that even with this supplementary 
information precision rates never exceed 50%. 
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Fig 1. Bar charts of Precision and Recall rates for each method and simulation setting, 
averaged over all simulation samples. 



To finish with, we would like to lay the emphasis on computing times. For 
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Fig 2. Boxplots of Precision, Recall and Fallout statistics for all methods except Shrinkage 
in a setup p = 100, for 200 simulated data sets. Best Lasso penalties chosen on the basis of 
the BIC criterion. 



this we let the number of nodes range from 5 to 185 and fixed the number of 
observations at half the maximum number of nodes, i.e. n = 92. This leads to a 
ratio n/p ranging from 0.05 to 2. Computing times for the weighted LASSO with 
inference of the classification InfCI and selection of the best penalty, Renet- 
VAR and G1DBN are presented in the log-log scale in Figure 3. We can see 
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that running times for Renet- VAR and G1DBN can become a handicap as soon 
as p gets large while computing times for InfC'l rarely exceed 2 minutes. 
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Fig 3. Computing times on the log-log scale for Renet-VAR, G1DBN and InfCI (including 
inference of classes). Intel Dual Core 3-40 GHz processor. 



4.2. Yeast Data 

We confronted our model to time measurements of Saccharomyces cerevisiae 
gene expression data collected by Spellman et al. (1998). We focus on the subset 
of genes they identified as periodic, i.e. genes whose transcription levels over time 
show evidence that they are cell-cycle regulated. 

Remarks on the data set. This dataset is one of the first microarray ex- 
periments. It is thus doomed to be rather noisy, contrary to the simulated data 
sets. Besides, we had to face the problem of missing values, which appeared on 
some of the most important genes. We imputed them as the mean of the two 
closer known observations in time for the gene considered, before and after the 
time point of interest. 

On top of its noisiness, Spellman et al.'s data set is particularly hard to tackle 
from a statistical view point. Information is provided on 786 genes for only 18 
time points. This implies that using our algorithm we cannot activate more than 
17 * 786 = 13362 edges out of 789 * 786 = 617796 possible ones, that is to say 
2.2%. 

However, we can rely on experimental conclusions on yeast gene regulation 
networks to collect target information about the true edges of the graph. We 
compare our results to the adjacency matrix provided by the Yeastract database 
(www.yeastract.com). We retain information on documented direct relation- 
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ships, that is to say direct regulations confirmed by published experimental 
results. 

Note however that this theoretical benchmark is biased in two ways. First, 
some true edges might be missing because all regulations might not have been 
confirmed by experiments yet. Second, this graph gathers all reported regu- 
lations, whatever the conditions of the experiment. Some might not actually 
happen during the precise experiment we consider. We can suppose the effect of 
the first bias to be low in a model organism such as Saccharomyces cerevisiae. 
The effect of the second bias is much more likely however, since measurements 
are all made while cells are at the beginning of their growth, growing until ready 
for DNA synthesis. We cannot expect the whole range of possible regulations to 
happen in such a small portion of the cell cycle. 

This dataset illustrates quite well the biological properties our model is based 
upon. First, documented information reveals the existence of 1385 true edges 
(among more than 600000 possible ones in theory). The theoretical graph is 
thus extremely sparse. Secondly, the hub structure is quite clear: edges leave 
from only 26 out of 786 genes. Hence knowledge of the hubs provides crucial 
information on the position of edges. This phenomenon also clearly appears on 
Figure 4. Incoming degrees never exceed 20 but only 1 is null. On the contrary, 
outgoing degrees are null for the vast majority of genes. Significant degrees 
appear as outliers in this distribution, reaching up to 150 for some of them. 



Fig 4. Boxplots of incoming and outgoing degrees in Yeast theoretical adjacency matrix 



Discussion of the results. The setting is much harder than in the first 
simulated data sets, with a ratio n/p = 2.3% as well as harder than the last 
simulated dataset with less separated correlations between existing and non 
existing edges. Results presented in Table 2 show quite well the difficulty all 
methods encounter in front of this data set. Results for the Shrinkage approach 
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are not shown because the local false discovery rate step included in this method 
was heavily flawed by the lack of separability between edges and non edges. 
Except for the KnwCl penalty all LASSO based estimators are reduced to the 
null model. Both the BIC and AIC criteria do not find the increase in likelihood 
large enough to compensate for the complexity of any model with at least one 
edge. Performances of the KnwCl penalty and Renet-VAR remain lower than 
what we could expect from simulated results. 



Models 


Lasso 


Adaptive 


KnownCl 


InferCl 


Renet 


Precision 






0.082 




0.004 


Recall 








0.068 





0.003 


Fallout 








0.002 





0.002 



Table 2 

Precision, Recall and Fallout performances for all Lasso based methods and Renet- VAR on 
Spellrnan et al. 's data set. Best Lasso penalties chosen on the basis of the BIC criterion. 

Many reasons for such bad perfomances could be thought of. We already 
mentionned the noisiness of the data, which quite hardly differentiated the edges 
from non edges. Second, homogeneity of the VAR(l) model might be too strong 
an assumption. Last but not least, when looking more closely at how data were 
collected we noticed that measurements were made every 7 minutes, which might 
be long enough for dependencies to vanish. Also, since we measure values related 
to the cell cycle, measurements were necessarily made on different cells each 
time, thus measuring the expression levels on different individuals at each time 
point. In brief, this apparently longitudinal data set might share more common 
points with i.i.d. models than with VAR1 processes. 

4-3. E. coli S.O.S. DNA repair network 

In this section we quit the high dimensional setup and compare the performances 
of all methods in a much easier framework. We focus on a sub-network from E. 
Coli S.O.S. DNA repair network analyzed by Roncn ct al. 1 . Data provide infor- 
mation on the main 8 genes of the S.O.S. network (uvrD,lexA,umuD,recA,uvrA,uvrY,ruvA,polB) 
across 50 time points. Measurements rely on precise expression kinetics which 
allow Ronen ct al. to monitor mRNA expression levels every 6 min after expo- 
sition of the DNA to UV light at time 0. We will not dwell on the measurement 
technology here (see Ronen et al. (2002) for details). Note however that the 
authors do not measure the actual mRNA quantity present in the cell at time 
t but the instant promoter activity of each gene. Equivalence between the two 
measurements is guaranteed if the instant quantity of mRNA in the cell roughly 
equals its production rate, that is to say if there is no accumulation of mRNA 
in the cell. Under this assumption, Ronen ct al. 's data can be used as any 
microarray dataset. 

E. coli S.O.S. DNA repair network provides a precise benchmark: specific 
regulatory interactions in response to DNA damage have been characterized. In 



data downloadable on Uri Alon's homepage, http://www.weizmann.ac.il/mcb/UriAlon/ 
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other words, we can rely on a theoretical regulatory network which represents 
the main direct transcriptory regulations actually taking place during the ex- 
periment. According to the regularly updated EcoCyc database, lexA is the only 
regulator in this subnetwork, regulating all genes including itself. Concretly, the 
protein LexA is at the core of the regulation network, usually binding sites in the 
promoter regions of S.O.S. genes to repress their expression. As soon as RecA 
senses DNA damage (by binding to single-stranded DNA) , it becomes activated 
and induces LexA autocleavage. The decrease in LexA concentration alleviates 
the repression of S.O.S. genes. When damage is repaired, the level of activated 
RecA drops, LexA accumulates and represses again all S.O.S. genes. 

Detailed results are presented in Figure 5. We can see that performances 
differ a lot from one experiment to another. Particularly, experiments 1 and 4 
lead to significantly poor results although nothing should a priori distinguish 
them from 2 or 3 (1 and 2, respectively 3 and 4, share the same U.V. exposure). 

As on simulated data, the LASSO leads to poor results. G1DBN shows simi- 
larly poor performances here. Quite surprisingly, Renet-VAR does not perform 
as well as we could have expected from simulations. It reaches 50% of recall 
at the expense of very low precision rates. Adaptive penalty improves more the 
quality of the estimation than in the simulation studies. Now they increase the 
precision of the LASSO without really undermining the recall rate. Inference of 
the classification outperforms these, with higher recall and precision rates. This 
is quite interesting since except in experiments 1 and 4 where the LASSO provide 
almost no information, inference of the classes seems quite good although the 
initial LASSO still shows mediocre results. To finish with, the KnwCl penalty 
benefits quite well here from its extra information since it outperforms all other 
methods and manages to reach honest results even in datasets 1 and 4 which 
disturbed all other methods. 

Inferred graphs on experiment 2 are shown in Figure 6. The regulatory activ- 
ity of lexA is more or less recovered by all methods. What is interesting is that 
a common structure recurently shows up among false positives: regulations due 
to uvrA. This regulation pattern is particularly what dominates experiment 4 
and leads to so poor results. Strangely, we could not find any mention of this 
regulatory activity in the literature. Either there is a need for further biological 
research on this gene or there is an undirect regulation blurring the results. 
Another unknown regulation dominates all inferred graphs: regulation of uvrY 
by polB. It is all the more interesting as it survives the bad a priori that the 
KnwCl penalty holds against it. Further biological investigation could want to 
look at this couple of genes more closely. 

In this respect, we could note that the regulatory effect of activated RecA 
on LexA does not appear on these graphs, which we could see as a good point 
since this is a post-transcriptional regulation. We would also like to lay the 
emphasis on the fact that we here check selection consistency of all the methods 
but not their sign consistency. We only check whether we identify the right 
edges and not the activation/inhibition processes associated to them. Looking 
more closely at the estimated matrices, we can see that the (shrunk) correlations 
estimated between lex A and the remaining genes are all positive and not negative 
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Fig 5. Summary of Precision, Recall and Fallout (respectively Prec, Rec. and Fall.) values 
for each method and experiment 



as the literature would tell. This would not be a flaw in all methods but a 
direct result of the limitations of transcriptomic data. Indeed, we only observe 
mRNA production rates. As a consequence, we cannot spot the decrease in 
concentration of protein LexA and only observe that the expression of all genes 
suddenly increases, lex A included. 



5. Conclusion 

This paper proposes a weighted-LASSO algorithm designed to tackle time varying 
gene expression data taking into account an underlying structure. We observe 
that in a perfect VAR1 setting, taking time dependencies into account leads to 
dramatically improved results for graph inference. In this particular framework, 
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Fig 6. Graphs inferred by the different methods on experiment 2 data. LASSO penalties are 
chosen so as to maximize the BIC criterion. True positives are drawn in black while false 
positives are shown in dashed gray. 

the proposed approach outperforms similar methods. Even when regulators and 
rcgulatccs cannot a priori been distinguished through analysis of the literature, 
inference of the classification improves a lot the performances of the LASSO. It 
therefore seems good to advice that, whenever available, knowledge about poten- 
tial transcription factors should be taken into account and that basic knowledge 
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on the topology of biological networks should not be omitted in the modeling 
process. We also want to emphasize the fact that this method reaches great 
results on networks of reasonable size for always reasonnablc computing times. 
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